ABSTRACT
Type 2 diabetes mellitus (T2D) presents a major health and economic burden that could be alleviated with improved early prediction and intervention. While standard risk factors have shown good predictive performance, we show that the use of blood-based DNA methylation information leads to a significant improvement in the prediction of 10-year T2D incidence risk. Previous studies have been largely constrained by linear assumptions, the use of CpGs one-at-a-time, and binary outcomes. We present a flexible approach (via an R package, MethylPipeR ) based on a range of linear and tree-ensemble models that incorporate time-to-event data for prediction. Using the Generation Scotland cohort (training set n cases =374, n controls =9,461; test set n cases =252, n controls =4,526) our best-performing model (Area Under the Curve (AUC)=0.872, Precision Recall AUC (PRAUC)=0.302) showed notable improvement in 10-year onset prediction beyond standard risk factors (AUC=0.839, PRAUC=0.227). Replication was observed in the German-based KORA study (n=1,451, n cases = 142, p=1.6×10 -5 ).